Summarizing and querying data generated from multiple scenarios of a data-intensive simulation

ABSTRACT

Simulation data is summarized and queried. A user provides an indication of simulation data that will be subsequently queried. The queried simulation data comprises (i) a set of key attributes, (ii) a set of events, and/or (iii) a set of causality relationships between a plurality of the events. First level summaries summarize simulation executions of scenarios of a combinatorial process and comprise (i) a summary of the frequency distribution of key attribute values, (ii) a timestamp for each event, and (iii) an indication of causality between events observed during the simulation. Second level summaries summarize executions of the given scenario and comprise (i) a consolidated distribution probability for the key attributes, (ii) a frequency distribution of occurrences of the events in a single execution, and (iii) a frequency of observations of the causality between pairs of events. In response to a query, second level summaries of selected scenarios are accessed to retrieve information related to the elements expressed in the query and to produce a third level summary that aggregates information accessed from the second level summaries of the selected scenarios.

FIELD

The field relates generally to simulation of combinatorial processes,such as logistics processes, and more particularly, to techniques forsummarizing and querying data related to such simulations.

BACKGROUND

Software applications for simulations of complex systems in discretetime are often memory and storage intensive, leading to computationalconstraints that may render thorough simulations infeasible. Not onlydoes this kind of software generate a large amount of data, but theproblem of extracting meaningful information from massive quantities ofdata in a reasonable amount of time can be daunting.

Simulation applications are often used in order to study complexsystems, where a large number of parameters inherent to these systemsneeds to be considered. Each combination of parameters corresponds to ascenario that typically need to be simulated multiple times due to thenon-determinism of various parts of the system. Simulations aregenerated to answer several possible hypotheses, considering subsets ofthe simulated scenarios. Computations of simulation results for specificscenarios often generate large amounts of data to be stored and take along time. Moreover, this large volume of data needs to be consolidatedor composed in meaningful views, so as to enable domain experts to querythe data and derive conclusions. In addition, obtaining query results ina timely manner can be unpractical if these computations are triggeredonly when the queries are executed.

A need therefore exists for improved techniques for dealing with thislarge amount of data, and for reducing query response times.

SUMMARY

Illustrative embodiments of the present invention provide methods andapparatus for summarizing and querying data generated by data-intensivesimulations. In one exemplary embodiment, a method comprises the stepsof obtaining a first level summary for each execution of a simulation ofa plurality of scenarios of a combinatorial process, wherein each of theplurality of scenarios corresponds to a distinct combination ofexploration attributes, wherein the simulation comprises a combinationof the exploration attributes comprising a plurality of independentvariables that are varied during the simulation and key attributes ofthe combinatorial process that are a target of the simulation, andwherein a user has provided an indication of simulation data that willbe queried following the simulation, wherein the simulation data thatwill be queried comprises one or more of (i) a set of the keyattributes, (ii) a set of events, and (iii) a set of causalityrelationships between a plurality of the events, wherein each of thefirst level summaries comprise one or more of (i) a summary of the keyattributes indicating a frequency distribution of each attribute valuein the one or more of the key attributes, (ii) a timestamp ofoccurrences of each of the events, and (iii) an indication of whetherthe causality between the plurality of the events is observed during thesimulation; obtaining a second level summary for each of the scenarios,wherein each of the second level summaries summarizes one or moreexecutions of the given scenario and comprises one or more of (i) aconsolidated distribution probability for each of the key attributes,(ii) a frequency distribution of occurrences of each of the events in asingle execution, and (iii) a frequency of observations of the causalitybetween each pair of the events; and in response to a user query thatincludes one or more ranges of exploration attributes that restrict theuser query to a specific set of selected scenarios to be considered andone or more of (i) the key attributes, (ii) the events, and (iii) thecausality between a plurality of the events, performing the followingsteps: (a) interpreting the user query; (b) accessing second levelsummaries of the selected scenarios to retrieve the information relatedto the key attributes, events and causality expressed in the query; (c)producing as output a third level summary that aggregates theinformation accessed from the second level summaries of the selectedscenarios and contains one or more of (i) probability distributionfunctions of the key attributes, (ii) probability distribution functionsof the number of occurrences of the events, and (iii) composedprobabilities of the causality relationships between the events.

In one or more embodiments, the first level summary and the second levelsummary are generated during the simulation and wherein the second levelsummaries are subsequently used to generate the third level summaries inresponse to one or more of the user queries.

In at least one embodiment, the simulation data that will be queriedfurther comprises one or more hierarchies of one or more aggregationattributes that group one or more of the key attributes at allsummarization levels. The step of interpreting the user query optionallyfurther comprises extracting one or more of (i) desired key attributes,(ii) a subset of valid values or intervals for each of the explorationattributes, and (iii) a subset of valid values or intervals for each ofthe aggregation attributes as defined by the user query. The accessingof the second level summaries of the selected scenarios to retrieve theinformation can be based on the valid values or intervals.

In an exemplary parallel implementation, the simulation occurs inparallel among one or more compute nodes on a distributed computinginfrastructure, and each of the first level summary, the second levelsummary and the third level summary are generated in parallel among oneor more compute nodes of the distributed computing infrastructure. Inaddition, the first level summaries and second level summaries areoptionally computed using volatile in-memory storage and aresubsequently persisted in non-volatile disk storage for future use.

Advantageously, illustrative embodiments of the invention provideimproved techniques for summarizing and querying data-intensivesimulation data. These and other features and advantages of the presentinvention will become more readily apparent from the accompanyingdrawings and the following detailed description.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates occurrences of an exemplary event in time;

FIG. 2 illustrates a simulation execution and the times in which twosimulated events occur;

FIG. 3 illustrates an exemplary first level summarization in accordancewith one embodiment of the invention for each execution a givenscenario;

FIG. 4 illustrates an exemplary scheme for storing events and causalityin the exemplary first summarization level of FIG. 3 for each exemplaryexecution;

FIG. 5 illustrates an exemplary scheme for storing events and causalityin an exemplary second summarization level, for each exemplaryexecution;

FIG. 6 is a flow chart illustrating an exemplary implementation of anexemplary simulation data summarization and query process according toone embodiment of the invention;

FIG. 7 illustrates an exemplary architecture for a summarizer accordingto one embodiment of the invention; and

FIG. 8 is a system diagram of an exemplary computer system on which atleast one embodiment of the invention can be implemented.

DETAILED DESCRIPTION

Illustrative embodiments of the present invention will be describedherein with reference to exemplary communication, storage, andprocessing devices. It is to be appreciated, however, that the inventionis not restricted to use with the particular illustrative configurationsshown. Aspects of the present invention provide methods and apparatusfor summarizing and querying data generated by data-intensivesimulations. In one or more embodiments, multiple possible interestingscenarios are simulated ahead of time and the results are stored. Onlythe results that are relevant for each query are evaluated. Thisapproach addresses the need of responsiveness for decision makingavoiding the need to read and analyze, at query execution time, thelarge amount of data generated by simulation. Thus, one or moreembodiments create summaries for each scenario ahead of time, and thesesummaries are combined on demand, without loss of data accuracy, torespond to future queries. One or more aspects of the inventionrecognize that the queries to be executed will determine which data,among all data generated by the simulations, will be summarized andstored.

While aspects of the present invention are illustrated in the context ofan exemplary supply chain logistics simulation application in the oiland gas industry, the present invention applies in any context wherelarge amounts of simulation data must be collected and subsequentlyprocessed to respond to user queries, as would be apparent to a personof ordinary skill in the art.

One or more aspects of the invention recognize that, from a performanceviewpoint, central processing unit (CPU) access is considerably fasterthan disk write speed, which reflects on overall economic models forsoftware utilization. In other words, running a simulation project andstoring all its generated data so it can be used in any possible futurequery can be very expensive. Alternatively, storing only the summaries,i.e., a small subset of all generated data, sufficient to respond toforeseeable queries, can often be cheaper and faster.

In one or more embodiments, an application user defines which keyfeatures and events should be monitored and summarized during theexecution of the simulation. In addition, users can specify causalityrelationships among events that are of interest for later analysis.Based on these definitions, simulation data are filtered, grouped,stored and indexed in order to quickly answer queries related toprobabilities that take into account only selected subsets of thesimulated scenarios.

Software applications for simulations in discrete time are widely usedto help business analysts or scientists analyze and predict complexphenomena or behaviors. Typically, understanding and analyzing aspecific domain problem requires that these applications be executedmany times, sometimes thousands or millions of times, according tovarious scenarios, which are combinations of business domain inputparameters determined by the domain analysts. As a consequence, a largeamount of data can be generated by successive simulations during asingle project, and often this generated data is too much for theavailable storage space allocated for a project. Moreover, this largevolume of data needs to be consolidated or composed in meaningful views,so as to enable domain experts to query the data and derive conclusions.Additionally, response times need to be short enough in order not toimpact the ability to quickly make decisions. Therefore, ways of dealingwith this large amount of data need to be created, in order to deal withconstraints of both storage and query response times.

U.S. patent application Ser. No. 14/663,630, filed Mar. 20, 2015,entitled “Methods and Apparatus for Evaluation of CombinatorialProcesses Using Simulation and Multiple Parallel Statistical Analyses ofReal Data,” incorporated by reference herein, describes a framework forthe integration of large-scale simulation and Big Data analytics. Inthis framework, simulation models are created based on domain knowledgeabout the various phases of the process to be addressed. In addition,Big Data analytics on real world data is applied to embed within thesimulation model the observed variability of the different phases indifferent scenarios. When the model is complete, multiple simulationsare executed to generate a population that can be used to predict whatcan happen in the different scenarios. In order to provide a mechanismfor quickly answering queries related to selected scenarios, Big Dataanalytics is applied again to create global prediction models that cansubstantially instantaneously provide answers to complex queries such asdistribution probabilities of key features, probabilities of events andprobabilities of causality between events. In one or more embodiments,the present invention provides a summarization method that correspondsto this application of Big Data analytics.

In one or more embodiments, the techniques disclosed herein addresschallenges related to: (i) the reduction of required storage, (ii) theefficient creation of summaries, (iii) the efficient and substantiallyaccurate execution of queries, and (iv) the computation of answersrelated to the probability of events and the causality between events,each discussed hereinafter.

Reduction of Storage

Running a simulation application under several different scenarios andconditions generates large quantities of data. An efficient way to storedata is therefore highly desired; more specifically, a way tosubstantially permanently store only the desirable subsets of the wholedata, conveniently summarized and structured to be stored for furtherqueries and analyses.

Efficient Creation of Summaries

In one or more embodiments, the disclosed exemplary summarizationprocess takes place in a timely fashion and does not penalize thesimulation process as a whole. It is desirable that the summarizationoccurs substantially in parallel with the simulation itself, but withoutcreating bottlenecks. As simulation is executed in multiple computationnodes, it is desirable to reduce the communication between simulationand summarization nodes as much as possible. In addition, summarizationshould be computed avoiding undesirable I/O (input/output), with minimaldisk storage of partial results.

Execution of Queries

The queries to be executed are related to subsets of the simulatedscenarios. It is then desirable to provide flexibility so that summariescan be accurately composed as if their original results were stillavailable. In addition, it should be possible to compose multiplescenarios where the number of executions of each scenario can vary dueto different levels of uncertainty.

Query computation efficiency is an important feature of a decisionmaking tool. In order to reduce the amount of data to be read from disk,it is desirable to group and index summaries, by taking into account thepossible queries that can be executed.

Scalability is also important as a single query might be related tothousands of scenarios. Even in this case, the response time for thequery should be minimal

Probability of Events and Causality

Discrete-time simulation software often simulates complex processes inorder to identify behavioral patterns. Identifying behavioral patternsoften means finding the likelihood of the occurrence of critical eventsin specific scenarios. In addition, it is usually important for decisionmaking to identify whether the occurrence of one given event during thesimulation leads to the occurrence of another event; i.e., if causalityexists between two specific events. A means of querying events and theircausality probability in the simulation data is therefore important inone or more embodiments.

Summarizing and Querying Data-Intensive Simulation Data

Simulations enable the study of processes that are typically either rareor impractical to study in real-life. Such studies consist of runningvarious experiments and exploring a typically large universe of inputparameters, so that researchers can analyze their outcome over time.However, in some environments, the generated output data may often reachunpractical scale. In this case, storage constraints become a potentialchallenge for running multiple simulation experiments. Moreover,generating big volumes of data without efficient ways of querying themso as to obtain fast results is of no use.

One or more embodiments of the invention provide a summarizationstrategy over simulation data, taking into account future probabilityqueries that might consider any subset of the simulated scenarios. Thissummarization allows for fast query execution as well as for efficientstorage usage. Assume that each simulation execution generates Xmegabytes of data; thus, a single scenario with n executions has n×Xmegabytes of data to be read. If the results are summarized into asingle file per scenario with Y megabytes and Y<<X→Y=1/pX (where p is anarbitrary fraction of X, meant to highlight the fact that the necessarystorage is reduced by a factor of n*p, and not just n), each query willread only 1/pX megabytes instead of n×X megabytes.

To accomplish both goals—queries with short response times and efficientuse of storage—a simulation Domain Expert defines ahead of time (i.e.,prior to the simulation run), how the simulation data will be used onsubsequent queries, as explained below. It is important to stress thatthese definitions establish what kind of information should be preservedand structured to answer a large amount of possible queries. In thisway, all these queries can be answered almost instantaneously withoutthe need for further computation of simulations.

Central to the idea of defining ahead of time what data will bepersisted after the simulation is run, as well as how this data shouldbe stored, are the concepts of summarization attributes, and events andcausality, each discussed hereinafter.

Summarization Attributes

Summarization Attributes comprise all parameters that are important tothe user and will take part in the summarization process. In one or moreembodiments, the summarization attributes comprise the following:

-   -   Exploration Attributes (X)—The set of independent variables,        i.e., the input parameters that are varied over their respective        allowed ranges (usually a discrete domain) and will define        different scenarios.    -   Key Attributes (K)—The set of dependent variables, i.e.,        attributes that are the target for analysis (e.g., attributes to        be monitored). As used herein, a “set” may comprise zero or more        elements. Those attributes are of two kinds:        -   Key element attributes, representing the attributes of the            elements processed by the simulation: for each key element            attribute, histograms are generated during the summarization            process according to the possible Aggregation Attributes (as            discussed further below in conjunction with FIG. 3) in order            to evaluate the frequency of occurrences of each attribute            value in the processed elements, based on scenarios            previously defined by Exploration Attributes;        -   Key temporal attributes, representing properties assigned to            a time instant of the simulated horizon: for each key            temporal attribute, a histogram is generated during the            summarization process, in order to count the number of time            units that each attribute value assumed during the            simulation, based on scenarios previously defined by            Exploration Attributes.    -   Aggregation Attributes (A)—Optional parameters that are used to        group key attributes and exhibit aggregated values.

As an example of summarization attributes, consider a supply chainlogistics simulation application. The domain expert may want to analyzehow the time to deliver supply orders (lead time) varies according todifferent combinations of fleet size and ship capacity. The expert mightalso want to know how port occupation with respect to docked shipschanges over time. In this case, the Exploration Attributes would befleet_size and ship_capacity while lead_time would be a key elementattribute and port_occupation would be a key temporal attribute.Additionally, the domain expert may want to analyze lead time resultsgrouped by destination. In this case, destination is an AggregationAttribute.

Events and Causality

In addition to querying for attributes and their aggregations, thedomain expert may also want to record events that occurred in thesimulation and identify causality between events. In order to do so, thedomain expert also needs to predefine which events to observe and whichcausality relations between those events are of interest.

An event is defined by the Domain Expert as a label and a computationformula. This computation formula can be arbitrarily complex. In one ormore embodiments of the invention, the computation formula shallcomprise a logical expression combining key temporal attributes.Consider the set of key temporal attributes K={k₁, k₂, . . . , k_(n)}.An event can be defined as the following logical expression to bechecked at substantially every simulation instant t in a simulationexecution T:

${{\underset{j}{\overset{j \in K}{⩓}}\underset{i = 0}{\overset{\delta}{⩓}}\left( {k_{j}^{t - i}\varphi\; V_{j}} \right)}❘{\varphi \in \left\{ {= {,{< {,{> {,{\leq {,{\geq {, \neq}}}}}}}}}} \right\}}},{V_{j} \in {{{dom}\left( k_{j} \right)}{\forall{j \in K}}}}$

where k_(j) ^(t) is the value of the key temporal attribute k_(j)∈Kmeasured at instant t; δ is the substantially minimum time interval forthe observation of the event; and operation φ establishes a logicalrelationship between k_(j) and an arbitrary value V which is a possiblevalue within the domain dom(k_(j)) of the key attribute.

The aforementioned logical expression is a conjunction of logicalexpressions between key temporal attributes (at different time instants)and threshold values. For instance, if the simulation experiment has theaverage lead time (avg_lead_time) in the last 24 h as a key temporalattribute, it is possible to use ‘avg_lead_time>400’ as an event. If itis desirable to check whether the average lead time remained higher than400 for a specific period of time, it is possible to use a more complexformula such as Λ_(i=0) ⁴⁰ lead time^(t-i)>400 that will evaluate totrue only at periods when the lead time stayed above the 400 limit forat least 40 instants. An algorithm that detects an event only onceduring a contiguous window could be desirable, depending on the context.

FIG. 1 illustrates occurrences of an exemplary event in time. As shownin FIG. 1, an event e_(x) is observed in time instants 50 through 54,inclusive. More complex formulas defining the exemplary event could alsoinclude different key attributes and different logical relationships forthe same key temporal attribute.

In addition to identifying events, users may also want to know theprobability that one specific event would cause another one. The DomainExpert can define which pairs of events are of interest. Given a set ofall possible events E={e₁, e₂, . . . , e_(n)}, the system can evaluatethe causality C(e_(i),e_(j)) between every pair of events (e_(i),e_(j)),in a simulation execution T if, for every instant t in the execution,the occurrences of e₁ and e_(j) are known to be either true or false.For a discussion of suitable techniques for computing causality betweenevents and the inference of causal relationships in time series data,see, for example, Samantha Kleinberg, “A Logic for Causal Inference inTime Series With Discrete and Continuous Variables,” IJCAI Proc. Int'lJoint Conf. on Artificial Intelligence, Vol. 22. No. 1 (2011),incorporated by reference herein in its entirety.

FIG. 2 illustrates a simulation execution and the times in which eventse_(i) and e_(j) occur (e.g., events e_(i) and e_(j) are true). In theexample of FIG. 2, the lists of times when events e_(i) and e_(j) occurare the following:

e_(i)={10, 25, 39, 72}

e_(j)={30, 44, 67}

In FIG. 2, the computation of causality C(e_(i),e_(j)) on simulationexecution T returns a value that can be either true or false, meaning,respectively, that T supports or does not support the hypothesis thate_(i) is a cause of e_(j). Furthermore, a false value is distinguishedfrom an absence of a value (the latter meaning that simulation executionT does not indicate whether or not e_(i) causes e_(j)). This is the casewhen no instances of e_(i) happen before e_(j) in simulation executionT.

If event e_(i) in fact happens before event e_(j) in simulationexecution T, then event e_(i) is considered as a prima facie cause ofevent e₁ in the case that event e_(j) is more likely to happen followingevent e_(i) than on its own. Then, a significance of event e_(i) as acause of event e_(j) is computed in comparison to every other possiblecause of event e_(j) in simulation execution T. If this significance isgreater than a threshold value, C(e_(i),e_(j)) yields a true value;otherwise it yields a false value.

It is noted that, in one or more exemplary embodiments, all of theseconditions are formulated in a temporal logic formalism (calledprobabilistic computation tree logic) in which time window constraintsare made explicit. Therefore, given a time window w, when checking for“e_(i) before e_(j)”, it is determined whether instances of e_(j) occurat-most w instants after e_(i) in T.

Multi-Level Summarization

The set of Exploration Attributes X={x₁, x₂, . . . , x_(n)} is the listof independent variables to be explored in the simulation. Each variablex_(i)∈X represents a set of discrete values to be explored by thesimulation application. The set of possible values for x_(i) is referredto as dom(x_(i)) or the domain of x_(i). A scenario is defined as anexecution instance that receives as input a distinct combination ofinput values for each x_(i)∈X. Thus, the set of all possiblecombinations of input parameters is the Cartesian productS=dom(x₁)×dom(x₂)× . . . ×dom(x_(n)).

Summarization First Level:

Generally, the exemplary first level summary provides a summary for eachexecution of each scenario. For each scenario, s_(i)∈S, severalexecutions of the simulation are performed. Each scenario is comprisedof a unique combination of Exploration Attributes. Given the fact thatmost available simulation applications are non-deterministic orpseudo-random, different executions in the same scenario can lead todifferent outputs. In this way, it is often desirable to simulate eachscenario a number of times until the inherent variability of thescenario is captured.

FIG. 3 illustrates an exemplary first level summarization 300 inaccordance with one embodiment of the invention for each execution 310-1through 310-N of a given scenario. Generally, the exemplary first levelsummarization 300 measures how the results vary for each execution 310of the given scenario. For each simulation execution 310 of a specificscenario (e.g., a subset of Exploration Attributes values), the keyattributes (k₁ to k_(p)) in K to be monitored are summarized by means ofhistograms 320-1 through 320-N, as shown in FIG. 3. In the example ofFIG. 3, there are p distinct key attributes (k).

The exemplary first level summarization 300 is conceptually ahierarchical key-value store where each key is the combination of thevalues of the aggregation attributes (a) in A, plus a key attribute. Asan example, in FIG. 3, the key for the first histogram 320-1 would be(a₁₁, a₂₁, k₁), where a₁₁ and a₂₁ are, respectively, values in thedomains of Aggregation Attributes a₁ and a₂. The number of histogrambins (ranges of values to consider for aggregation) can also be definedby the Domain Expert.

It is noted that increasing the number of aggregation attributes (thesize of the set A) in the summary structure might negatively affect thequery response time. This is due to the fact that the more that thesummary results are split into groups, the more computational effort isneeded in order to reconstruct different views (notice that in thefigure above n is the number of possible values for the aggregationattribute a₁ while m is the number of possible values for a₂).

Subtotal histograms 330-1 through 330-N are optionally also calculatedfor each execution 310 and stored for each aggregation attribute (a) toimprove query response time. The subtotals 330 are also indexed by thekey attributes (k). For instance, if it is desirable to access k₁aggregating only by a₁ (independent of a₂), the key for key attribute k₁in the first histogram 330-1 would be (a_(n),k₁).

Both key element and key temporal attributes can be aggregated viaaggregation attributes (a) in A as just described and computedhistograms for key attributes can be transformed into distributionprobabilities by taking into consideration the total number ofoccurrences. In the case of key temporal attributes, the number ofoccurrences corresponds to the number of instants of the simulationhorizon. In the case of key element attributes, the number ofoccurrences corresponds to the number of processed elements.

FIG. 4 illustrates an exemplary scheme 400 for storing events andcausality in the exemplary first summarization level 300 of FIG. 3 foreach exemplary execution 410-1 through 410-N. In the exemplary firstsummarization level 400, the information that is saved in one or moreembodiments to answer queries about events and causality corresponds tothe following annotations:

-   -   For events—a list of timestamps of occurrences of each event;        and    -   For causality—a Boolean value indicating whether the specific        causality is verified or not in the execution, according to a        causality metric specified by the Domain Expert (see, for        example, Samantha Kleinberg, “A Logic for Causal Inference in        Time Series With Discrete and Continuous Variables,” IJCAI Proc.        Int'l Joint Conf. on Artificial Intelligence, Vol. 22. No. 1        (2011), incorporated by reference herein).

In one or more embodiments, a True value is stored whenever, accordingto the metric, there is evidence of causality between the two events,and a False value is stored whenever there is evidence of not havingcausality between the two events. However, if the metric is not able todetermine with confidence whether the causality existed or not, nocausality value is stored. This is the case in the example of FIG.,where in Execution 410-N, no cases of e₂ are observed and, therefore, nocausality information C(e₁,e₂) or C(e₂,e₁) between e₁ and e₂ is stored.

Summarization Second Level:

As the first level runs summarization 300 runs for each execution withina scenario, all results can be reduced by aggregation to a singlesummary for the complete scenario. For instance, if 100 executions arerun for 10 scenarios, the final summarization will consist of 10 datasets, one per scenario.

In one or more embodiments, this reduction by aggregation is made foreach scenario on a reduce-by-key fashion, described as follows:

-   -   For each key in the summary, the histograms of the various        simulation executions are combined to generate consolidated        distribution probabilities. When there are many simulation        executions, a distribution is generated for each execution and        the distributions are combined, assuming an equal probability.        When there are few simulation executions, however, an execution        with outliers can lead to a bias. In order to avoid the bias,        all occurrences can be combined in a single histogram, reducing        the weight of outliers, and then generating a probability        distribution.

FIG. 5 illustrates an exemplary scheme 500 for storing events andcausality in the exemplary second summarization level, for eachexemplary execution 510-1 through 510-N.

-   -   For each event (e), a histogram 520 is computed for the number        occurrences of the event in a single execution. For instance,        suppose scenario #1 was executed 100 times. One possible        histogram for event e₁ occurrence could be: {0 times in 10        executions, 10 times in 80 executions, 20 times in 10        executions}. Such a histogram is then transformed into a        distribution probability for the number of occurrences of the        event.    -   For causality (C)—for example, the frequency of True values in        the causality annotations between two events e_(i) and e_(j)        530. This frequency corresponds to the percentage of True values        among the executions that reported either a True or False value        for causality.

When all the results from the multiple executions 510 of a scenario aresummarized, computing probabilities for key attributes, events andcausalities are persisted and the original data generated by thesimulation application are optionally discarded.

Summarization Third Level:

After the simulation process is finished and there is one summary perscenario, the user can execute queries over the aggregated data.

It is noted that only queries that include attributes originally definedprior to the simulation execution in the sets X, A and K (by the DomainExpert) are allowed to be executed. This is important for theunderstanding of this approach: defining which data is important foranalysis defines which aggregated data will be persisted to besubsequently used for queries. Therefore, for each query execution, athird level summarization occurs on-demand (e.g., in response to theuser query).

For instance, let a certain simulation run with exploration attributesx₁ with range {1, 2, 3} and x₂ with range {10, 20, 30, 40, 50}; let alsok₁ and k₂ be key temporal attributes. Suppose query Q1 is issued to showthe distribution probability for k₁, with x₁ fixed to {1} but with x₂considering any of the lower values {10, 20}. The query consults thestored summarizations of scenario (1, 10) and scenario (1, 20) andcomposes the corresponding distribution probabilities for k1. Such acomposition can either assume equal probability for the scenarios orassume that they have specific probabilities provided by the user.

A query can also correspond to the computation of the distributionprobability for the number of occurrences of a certain event when asubset of the possible scenarios is considered. In this case, the storeddistribution probabilities of the corresponding scenarios are composed.This is done in the same way the distribution probabilities of keyattributes are composed. Finally, a query can correspond to theprobability of causality between events. In this case, there is a singleprobability value for the causality in each scenario to be considered.The values for each scenario are recovered and then composed. In bothcases, computation can assume either equal probabilities for thescenarios or user-defined probabilities for each scenario.

Query Definition and Execution

When a User defines the key attributes K={k₁, k₂, . . . , k_(n)},exploration attributes X={x₁, x₂, . . . , x_(m)} and aggregationattributes that he or she wants to consider in summarization, as well asevents and causalities involving these attributes, the user is choosingthe data that will be persisted, as well as how these data will bepersisted; moreover, pre-computed aggregations will be stored to be usedin queries.

Consider the overseas supply chain logistics example, where theexploration attributes are X={‘fleet_size’, ‘ship_capacity’}. The userdefines two key temporal attributes K={‘port_occupation, ‘lead_time’}(where the first one corresponds to the percentage of occupied docks inthe port at each instant and the second one corresponds to the averagelead time observed in the last 24 h), one single aggregation attributeA={‘material_type’}, and two events:

e_(port_overload)=‘port_occupation’>0.75;

e_(delay)=‘lead_time’>240

The user also wants to compute the causality between the two events:

C(e_(port_overload),e_(delay)).

The queries the user could perform, based on the previously definedattributes, are shown below, described in some hypothetical SQL-likeexamples, with a distinctive particularity: in relational databases, theWHERE/AND clauses define a predicate that returns a horizontal subset ofthe relation. However, since only distributed summaries are stored inone or more embodiments, the WHERE/AND clauses in these examplesrepresent substantially exactly the third level summarizations among thedifferent scenario summaries that were already previously created:

-   -   Query: distribution probability for the key attribute ‘port        occupation’, restricting the range of values of ‘fleet size’ and        ‘ship capacity’ explored on different scenarios, according to        the domain of the exploration attributes:

SELECT get_distribution_probability(‘port_occupation’)

FROM <simulation_run#N>

WHERE ‘fleet_size’ IN [20, 40]

AND ‘ship_capacity’ IN [500, 700]

Returns: one dataset corresponding to the aggregated distributionprobability:

aggr(fleet_size=[20,40], ship_capacity=[500, 700])

-   -   Query: occurrences of defined events and causality, restricting        the range of values of ‘fleet size’ and ‘ship capacity’:

SELECT get_distribution_probability(‘port_overload’)

FROM <simulation_run#N>

WHERE ‘fleet_size’ IN [20]

AND ‘ship_capacity’ IN [500, 700]

Returns: one dataset for the case below:

aggr(fleet_size=[20], ship_capacity=[500, 700])

-   -   Query: histogram for the key attribute ‘lead time’, grouped by        the ‘material type’ aggregation attribute, also restricting by        the range of values of ‘fleet size’ and ‘ship capacity’:

SELECT ‘material_type’, get_distribution_probability(‘lead_time’)

FROM <simulation_run#N>

WHERE ‘fleet_size’ IN [20, 40]

GROUP BY ‘material_type’

Returns: m datasets, where m is the number of distinct material types,for the case below:

aggr(fleet_size=[20,40], material_1)

aggr(fleet_size=[20,40], material_2)

. . .

aggr(fleet_size=[20,40], material_m)

FIG. 6 is a flow chart illustrating an exemplary implementation of anexemplary simulation data summarization and query process 600 accordingto one embodiment of the invention. As shown in FIG. 6, the exemplarysimulation data summarization and query process 600 initially obtains,during step 610, prior to an execution of a simulation of acombinatorial process, an indication of simulation data that will bequeried following the simulation. As noted above, the simulationcomprises a combination of exploration attributes comprising a pluralityof independent variables that are varied during the simulation and keyattributes of the combinatorial process that are a target of thesimulation. The simulation data that will be queried comprises one ormore of (i) a set of the key attributes, (ii) a set of events, and (iii)a set of causality relationships between a plurality of the events.

During step 620, the exemplary simulation data summarization and queryprocess 600 simulates a plurality of scenarios of the combinatorialprocess. Each scenario corresponds to a distinct combination of theexploration attributes.

The exemplary simulation data summarization and query process 600generates a first level summary during step 630 for each execution ofeach of the scenarios. Each first level summary comprises a summary ofthe key attributes indicating a frequency distribution of each attributevalue in the key attributes, a timestamp of occurrences of each of theevents, and/or an indication of whether the causality between theplurality of events has occurred during the simulation.

During step 640, the exemplary simulation data summarization and queryprocess 600 generates a second level summary for each scenario. Eachsecond level summary summarizes one or more executions of the givenscenario and comprises a consolidated distribution probability for eachof the key attributes, a frequency distribution of occurrences of eachof the events in a single execution, and/or a frequency of occurrencesof the causality between each pair of events.

A test is performed during step 650 to determine if a user query isreceived. The user query typically includes one or more ranges ofexploration attributes that restricts the query to a specific set ofselected scenarios to be considered and (i) the key attributes, (ii) theevents, and/or (iii) the causality between a plurality of events.

Once it is determined during step 650 that a user query is received,then the user query is interpreted during step 660. Thereafter, theexemplary simulation data summarization and query process 600 accessessecond level summaries of the selected scenarios during step 670 toretrieve the information related to the key attributes, events andcausality expressed in the query.

Finally, the exemplary simulation data summarization and query process600 produces as output a third level summary during step 680 thataggregates the information accessed from the second level summaries ofthe selected scenarios and contains (i) probability distributionfunctions of key attributes (ii) probability distribution functions ofthe number of occurrences of events, and/or (iii) composed probabilitiesof the causality relationships between events.

Query answers thus typically consist of probability distributions of keyfeatures, probabilities of critical events and probabilities ofcausality between events that may occur in datasets generated by thesimulations.

FIG. 7 illustrates an exemplary architecture for a summarizer 700according to one embodiment of the invention. Generally, the summarizer700 starts the simulation, by firing a predefined number of executionsfor each predefined scenario. The summarizer 700 is responsible for theoverall task of reading all generated data, performing thesummarizations at first and second level, and persisting the aggregateddata, by calling the more specific components.

In one exemplary implementation, the summarizer 700 is an applicationrunning on a node in a computer cluster with several nodes, and it isthe main orchestrator for the whole process. A user provides theexemplary summarizer 700 with four input parameters crucial to thesummarization process:

-   -   A pointer to the location of the simulation application 740 and        its configuration parameters;    -   The user-defined list of Attributes, events and Causalities;    -   A pointer to a location of an in-memory database where the        second-level aggregation will occur;    -   List of addresses of cluster nodes as well as number of cores on        each node that it can use to start the execution of Summarizer        Engines.

As shown in FIG. 7, the exemplary summarizer 700 comprises a pluralityof summarizer engines 710-1 through 710-N in a first stage forgenerating the first level summaries 300. In at least one embodiment,the summarizer 700 is programmed to assign a number of processors (orprocessor cores) in the cluster for each execution of each simulation,by starting a summarizer engine 710 for each of them on a designatedcore.

For each execution of each scenario, a summarizer engine 710 is created,encompassing the simulation application 740, a worker 720 and a logger730. As shown in FIG. 7, several summarizer engines 710 can work inparallel. For example, if a scenario is set to run 100 times, theexemplary summarizer 700 will instantiate 100 instances of thesummarizer engines 710 substantially simultaneously.

Each worker 720 reads the simulation data being generated by thesimulation application 740 and converts the simulation data into a listof records ordered in time. The logger components 730 consume eachrecord being generated by the worker 720 in order to perform theaggregation for the key attributes and to identify and count the events.

The logger 730 is the component responsible for the first levelaggregation, discussed above. The output of the logger 730 is a set ofordered list of records, already aggregated by key attributes, for eachexecution of different scenarios. Thus, if 10 executions of 5 differentscenarios are performed, this summarization happens 50 times.

In one exemplary implementation, each summarizer engine 710 starts asimulation execution on a designated cluster core, and it starts a pairinstance of a worker 720 and a logger 730. During the simulationexecution, the simulation application 740 generates a set of data framesin local memory. These data frames are specific to the simulationapplication.

In at least one embodiment, the worker object 720 reads all data framesgenerated by the simulation application on-the-fly, i.e., substantiallyas the data is being produced by the simulation. The worker object 720performs programmable transformation rules to the read data, convertingthe information contained in the data frames into a single in-memorylog. This single in-memory log should contain the needed data aspreviously determined by the set of all input Attributes and it isshared with the logger 730. The logger 730 is responsible for readingthe log as it is being produced and aggregating the histograms at thesame time the simulation is running.

When a single execution finishes, the worker 720 signals the logger 730that the logger 730 can start to compute causality between events. Whenthe logger 730 finishes its work, i.e., when aggregations andcausalities are done, the logger 730 asks the summarizer engine 710whether it can aggregate its computed data into the in-memory database.If the answer is yes, second level aggregation is performed, and thelocal shared memory is then discarded.

The second level summarization happens substantially as soon as thefirst level summarization is finished; in this step, all aggregateddatasets that reside in separate summarizer engines 710 are reduced byaggregation into one single dataset per scenario, namely the scenariosummarizer 750 generating distribution probabilities and probabilitiesof causality for each scenario, as discussed above.

The summarizer engine 710 collects all logger 730 requests forproceeding with the second level, allowing them to contact the in-memorydatabase and dump their local summaries. Therefore, it can gauge whetherone simulation execution is taking considerably longer than the others.In this case, it will choose an idle core to re-start this simulation.Whichever of both simulations finishes first will be allowed to proceedto second level aggregation. The other will be aborted—all work done bythe Worker-Logger pair is discarded.

The scenario summarizer 750 is therefore responsible for accepting andaggregating all datasets generated from those summarization engines 710of a single scenario. This newly aggregated dataset can be, for example,a file inside a Distributed File System or part of an in-memorydatabase, and it is a task of the scenario summarizer 750 to persist thefile. The scenario summarizer 750 optionally also provides an indexingscheme that will make aggregations easier.

In one or more embodiments, there exists one scenario summarizer 750 perscenario. As noted above, a scenario is one combination of values ofExploratory Attributes.

The exemplary query engine is responsible for:

-   -   Interpreting a query definition;    -   Selecting scenarios to use, for specific queries;    -   Aggregating the selected scenarios summaries, possibly        considering user-defined probabilities for each scenario; and    -   Producing the output datasets.

In one or more embodiments, interpreting a query means to extract, froma textual description:

-   -   The desired key attribute k∈K;    -   The subset of valid values V_(i)⊆dom(x_(i)) for each Exploration        Attributes x_(i)∈X restricting the query; and    -   The set of Aggregation Attributes A′⊆A.

Once the query engine knows V={V₁, V₂, . . . , V_(n)}, the query enginecan check the scenarios that used the exploration attributes in V andselect the correct scenarios summary instances that will be used in thethird level aggregation.

With the correct scenarios summaries at hand, the query engine performsthe third level summarization. In order to do that, the query enginesearches all scenarios summaries for the keys that respect theattributes in A′ and k. The query engine then aggregates the obtainedvalues taking into account the probability of each scenario.

In at least one exemplary implementation, the query engine accepts userqueries that specify a subset of simulated scenarios to be considered.The query engine decides which scenario files need to be loaded andaggregated, based on the exploration attributes. The query engine thenaggregates these scenarios, possibly using a user-defined probabilityfor each scenario. Distribution probabilities for key attributes orevents or probabilities of causality are computed. Finally, the queryengine exposes the final query results either as memory streams or asfiles persisted in a file system. It also caches the most recent loadedscenarios, to speed up further query responses.

In one or more exemplary implementations, an in-memory database runs ona core which is separate from all summarizer engines 710. Each logger720 sends its aggregated data to the in-memory database, and it willperform the second level aggregation per scenario. Each scenario, i.e.,combination of exploration attributes, is a key by which the in-memorydatabase will aggregate the information sent by each logger 730 in ascenario dataset. This key is used for indexing, in order to speed upthe aggregation process. Each scenario dataset will be persisted as afile that can be chosen among common formats, provided theimplementation of the query engine can read this format. For instance,the scenario datasets could be written in JSON text files or in aproprietary format that the in-memory database can read.

Example

Consider a use case related to oil and gas platforms supply logistics. Asimplification of the logistics process follows:

-   -   Platforms request needed materials to Controller Offices;    -   Controller Offices generate Material Orders and send these        orders to Warehouses;    -   Warehouses process and pack these materials in Containers that        can vary by size and send them by terrestrial transport to        designated Ports;    -   When Containers arrive in designated Ports, they are queued        according to the urgency of the materials they contain;    -   Containers are placed in Ships, by taking into consideration the        Ship schedules;    -   Ships deliver materials to their Destination Platform according        to their routes and return to Ports.

It is a given fact that software to simulate such process would modelthe important entities above mentioned, such as Order, Warehouse, Port,Fleet, Container and Platform. Moreover, the simulation software definesthe lists of all values that each entity may assume at a specific time;these lists of values are referred to as the domain of these entities.

During the simulation process, the software records, at each timeinstant, information about these and all other relevant entities thattake part in the simulation, and generates frames of data for each ofthem.

Assume a user wants to run a simulation to predict whether reducing thefleet size will incur on bottlenecks in ports. Currently, the fleetcomprises 50 ships scattered among 10 ports, and the user wants to getan idea on how badly bottlenecks on ports start to show up should thefleet size be reduced by 10 ships. In this case, a bottleneck needs tobe defined as an event. For instance, let's say a bottleneck occurs whenany port is at its maximum capacity for more than two days. Assume thata port reaches its maximum capacity at 1000 containers. The user alsowants to determine which destination platforms will suffer the effect ofbottlenecks the most.

In this case, the input to the system before the simulation processstarts will be the following:

Key element attribute: lead_time;

Aggregate Attribute: destination_platform; and

Exploratory Attribute: fleet_size.

The bottleneck event would be defined as follows:

e (“bottleneck”)=Λ₀ ⁴⁸(number_containers_in_port>1000).

Once the simulation is finished, the needed summarizations that allowfor this query to be performed are already done and the original dataframes can be discarded. On a hypothetical SQL-like construct, thequeries will look like the following example:

SELECT get_distribution_probability(e(‘bottleneck’))

FROM <simulation_run#N>

WHERE ‘fleet size’ IN [40]

The result of the query will be a distribution probability for theoccurrence of a bottleneck at the port. If the user wants to investigatefurther the consequences of the bottleneck, the user may evaluate theprobability distribution of lead times per destination with thefollowing query:

SELECT ‘destination_platform’,get_distribution_probability(‘lead_time’),

FROM <simulation_run#N>

WHERE ‘fleet size’ IN [40]

GROUP BY ‘destination_platform’

The result of the query will be a distribution probability of the leadtime for each destination_platform. Based on the answers, the user cananalyze to what extent the reduction of the number of ships influencesthe probability of bottlenecks and which destinations are more affectedby this reduction.

Conclusion

Among other benefits, aspects of the present invention summarize andquery data generated by data-intensive simulations. Simulation ofcomplex systems usually generate large amounts of data that need to bemanaged and analyzed in order to efficiently answer queries related tomultiple different simulation scenarios. In one or more embodiments, adata summarization method for Discrete-time Simulation applications isprovided in which a large number of scenarios are simulated and queriesrelated to probabilities are executed. Simulation results are summarizedsubstantially on-the-fly in order to save storage and improve subsequentquery response times. Queries to be answered comprise distributionprobabilities of key features, probabilities of critical events andprobabilities of causality between events. In addition, these queriesspecify the set of scenarios that should be considered whenprobabilities are computed. Such a set can be any subset of thesimulated scenarios. One challenge is the summarization of results insuch a way that they can be accurately and efficiently combined toanswer the queries. In one or more embodiments, summaries are computedsubstantially in parallel with the simulations using both local andremote memory resources. Summaries contain only the desirable amount ofinformation for answering the possible future queries and are structuredto allow for an efficient computation of the probabilities.

It should therefore be understood that in other embodiments differentarrangements of additional or alternative elements may be used. At leasta subset of these elements may be collectively implemented on a commonprocessing platform or each such element may be implemented on aseparate processing platform.

Also, numerous other arrangements of computers, servers, storage devicesor other components are possible in the exemplary computing environment.Such components can communicate with other elements of the system overany type of network or other communication media.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems) and computer program products according to embodiments of theinvention. It is to be appreciated that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer program instructions. These computer program instructions maybe provided to a processor of a general purpose computer, specialpurpose computer, or other programmable data processing apparatus toproduce a machine, such that the instructions, which execute via theprocessor of the computer or other programmable data processingapparatus, create means for implementing the functions/acts specified inthe flowchart and/or block diagram block or blocks.

As further described herein, such computer program instructions may alsobe stored in a computer readable medium that can direct a computer,other programmable data processing apparatus, or other devices tofunction in a particular manner, such that the instructions stored inthe computer readable medium produce an article of manufacture includinginstructions which implement the function/act specified in the flowchartand/or block diagram block or blocks. Accordingly, as further detailedbelow, at least one embodiment of the invention includes an article ofmanufacture tangibly embodying computer readable instructions which,when implemented, cause a computer to carry out techniques describedherein. An article of manufacture, a computer program product or acomputer readable storage medium, as used herein, is not to be construedas being transitory signals, such as electromagnetic waves.

The computer program instructions may also be loaded onto a computer orother devices to cause a series of operational steps to be performed onthe computer, other programmable apparatus or other devices to produce acomputer implemented process such that the instructions which execute onthe computer or other programmable apparatus provide processes forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks.

The flowchart and block diagrams in the figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, component, segment,or portion of code, which comprises at least one executable instructionfor implementing the specified logical function(s). It should be notedthat the functions noted in the block may occur out of the order notedin the figures.

Accordingly, the techniques described herein can include providing asystem, wherein the system includes distinct software modules, eachbeing embodied on a tangible computer-readable recordable storage medium(for example, all modules embodied on the same medium, or each modulesembodied on a different medium). The modules can run, for example, on ahardware processor, and the techniques detailed herein can be carriedout using the distinct software modules of the system executing on ahardware processor.

Additionally, the techniques detailed herein can also be implemented viaa computer program product that includes computer useable program codestored in a computer readable storage medium in a data processingsystem, wherein the computer useable program code was downloaded over anetwork from a remote data processing system. The computer programproduct can also include, for example, computer useable program codethat is stored in a computer readable storage medium in a server dataprocessing system, wherein the computer useable program code isdownloaded over a network to a remote data processing system for use ina computer readable storage medium with the remote system.

As will be appreciated by one skilled in the art, aspects of the presentinvention may take the form of an entirely hardware embodiment, anentirely software embodiment (including firmware, resident software,micro-code, etc.) or an embodiment combining software and hardwareaspects that may all generally be referred to herein as a “module” or“system.”

An aspect of the invention or elements thereof can be implemented in theform of an apparatus including a memory and at least one processor thatis coupled to the memory and operative to perform the techniquesdetailed herein. Also, as described herein, aspects of the presentinvention may take the form of a computer program product embodied in acomputer readable medium having computer readable program code embodiedthereon.

By way of example, an aspect of the present invention can make use ofsoftware running on a general purpose computer. FIG. 8 is a systemdiagram of an exemplary computer system on which at least one embodimentof the invention can be implemented. As depicted in FIG. 8, an exampleimplementation employs, for example, a processor 802, a memory 804, andan input/output interface formed, for example, by a display 806 and akeyboard 808. The term “processor” as used herein includes anyprocessing device(s), such as, for example, one that includes a centralprocessing unit (CPU) and/or other forms of processing circuitry. Theterm “memory” includes memory associated with a processor or CPU, suchas, for example, random access memory (RAM), read only memory (ROM), afixed memory device (for example, a hard drive), a removable memorydevice (for example, a diskette), a flash memory, etc. Further, thephrase “input/output interface,” as used herein, includes a mechanismfor inputting data to the processing unit (for example, a mouse) and amechanism for providing results associated with the processing unit (forexample, a printer).

The processor 802, memory 804, and input/output interface such asdisplay 806 and keyboard 808 can be interconnected, for example, via bus810 as part of a data processing unit 812. Suitable interconnections viabus 810, can also be provided to a network interface 814 (such as anetwork card), which can be provided to interface with a computernetwork, and to a media interface 816 (such as a diskette or compactdisc read-only memory (CD-ROM) drive), which can be provided tointerface with media 818.

Accordingly, computer software including instructions or code forcarrying out the techniques detailed herein can be stored in associatedmemory devices (for example, ROM, fixed or removable memory) and, whenready to be utilized, loaded in part or in whole (for example, into RAM)and implemented by a CPU. Such software can include firmware, residentsoftware, microcode, etc.

As noted above, a data processing system suitable for storing and/orexecuting program code includes at least one processor 802 coupleddirectly or indirectly to memory elements 804 through a system bus 810.The memory elements can include local memory employed during actualimplementation of the program code, bulk storage, and cache memorieswhich provide temporary storage of at least some program code in orderto reduce the number of times code must be retrieved from bulk storageduring implementation. Also, input/output (I/O) devices such askeyboards 808, displays 806, and pointing devices, can be coupled to thesystem either directly (such as via bus 810) or through intervening I/Ocontrollers.

Network adapters such as network interface 814 (for example, a modem, acable modem or an Ethernet card) can also be coupled to the system toenable the data processing system to become coupled to other dataprocessing systems or remote printers or storage devices throughintervening private or public networks.

As used herein, a “server” includes a physical data processing system(such as system 812 as depicted in FIG. 8) running a server program. Itwill be understood that such a physical server may or may not include adisplay and keyboard.

As noted, at least one embodiment of the invention can take the form ofa computer program product embodied in a computer readable medium havingcomputer readable program code embodied thereon. As will be appreciated,any combination of computer readable media may be utilized. The computerreadable medium can include a computer readable signal medium or acomputer readable storage medium. A computer readable storage medium maybe, for example, but not limited to, an electronic, magnetic, optical,electromagnetic, infrared, or semiconductor system, apparatus, ordevice, or any suitable combination of the foregoing. Examples includean electrical connection having one or more wires, a portable computerdiskette, a hard disk, RAM, ROM, an erasable programmable read-onlymemory (EPROM), flash memory, an optical fiber, a portable CD-ROM, anoptical storage device, a magnetic storage device, and/or any suitablecombination of the foregoing. More generally, a computer readablestorage medium may be any tangible medium that can contain, or store aprogram for use by or in connection with an instruction executionsystem, apparatus, or device.

Additionally, a computer readable signal medium may include a propagateddata signal with computer readable program code embodied therein, forexample, in baseband or as part of a carrier wave. Such a propagatedsignal may take any of a variety of forms such as, for example,electro-magnetic, optical, or a suitable combination thereof. Moregenerally, a computer readable signal medium may be any computerreadable medium that is not a computer readable storage medium and thatcan communicate, propagate, or transport a program for use by or inconnection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium can be transmittedusing an appropriate medium such as, for example, wireless, wireline,optical fiber cable, radio frequency (RF), and/or a suitable combinationof the foregoing. Computer program code for carrying out operations inaccordance with one or more embodiments of the invention can be writtenin any combination of at least one programming language, including anobject oriented programming language, and conventional proceduralprogramming languages. The program code may execute entirely on a user'scomputer, partly on a user's computer, as a stand-alone softwarepackage, partly on a user's computer and partly on a remote computer, orentirely on the remote computer or server. In the latter scenario, theremote computer may be connected to the user's computer through any typeof network, including a local area network (LAN) or a wide area network(WAN), or the connection may be made to an external computer (forexample, through the Internet using an Internet Service Provider).

In light of the above descriptions, it should be understood that thecomponents illustrated herein can be implemented in various forms ofhardware, software, or combinations thereof, for example, applicationspecific integrated circuit(s) (ASICS), functional circuitry, anappropriately programmed general purpose digital computer withassociated memory, etc.

Terminology used herein is for the purpose of describing particularembodiments only and is not intended to be limiting of the invention.For example, the singular forms “a,” “an” and “the” are intended toinclude the plural forms as well, unless clearly indicated otherwise. Itwill be further understood that the terms “comprises” and/or“comprising,” as used herein, specify the presence of stated features,integers, steps, operations, elements, and/or components, but do notpreclude the presence or addition of another feature, integer, step,operation, element, component, and/or group thereof. Additionally, thecorresponding structures, materials, acts, and equivalents of all meansor step plus function elements in the claims are intended to include anystructure, material, or act for performing the function in combinationwith other claimed elements as specifically claimed.

Also, it should again be emphasized that the above-described embodimentsof the invention are presented for purposes of illustration only. Manyvariations and other alternative embodiments may be used. For example,the techniques are applicable to a wide variety of other types ofcommunication systems, storage systems and processing devices that canbenefit from improved summarization and querying of simulation data.Accordingly, the particular illustrative configurations of system anddevice elements detailed herein can be varied in other embodiments.These and numerous other alternative embodiments within the scope of theappended claims will be readily apparent to those skilled in the art.

What is claimed is:
 1. A method, comprising the steps of: obtaining a first level summary for each execution of a simulation of a plurality of scenarios of a combinatorial process, wherein each of said plurality of scenarios corresponds to a distinct combination of exploration attributes, wherein said simulation comprises a combination of said exploration attributes comprising a plurality of independent variables that are varied during said simulation and key attributes of said combinatorial process that are a target of said simulation, and wherein a user has provided, prior to a time of the simulation, an indication of simulation data that will be queried following said simulation, wherein said simulation data that will be queried comprises one or more of (i) a set of said key attributes, (ii) a set of events, and (iii) a set of causality relationships between a plurality of said events, wherein each of said first level summaries comprise one or more of (i) a summary of said key attributes indicating a frequency distribution of each attribute value in the one or more of said key attributes, (ii) a timestamp of occurrences of each of said events, and (iii) an indication of whether said causality between said plurality of said events is observed during said simulation; obtaining a second level summary for each of said scenarios, wherein each of said second level summaries summarizes one or more executions of said given scenario and comprises one or more of (i) a consolidated distribution probability for each of said key attributes based on an aggregation of the frequency distribution of each attribute value in the first level summaries, (ii) a frequency distribution of occurrences of each of said events in a single execution based on the timestamp of occurrences over time from the first level summaries, and (iii) a frequency of observations of the causality between each pair of said events based on the indication of whether said causality between said plurality of said events is observed during said simulation from the first level summaries, wherein the first level summary and the second level summary comprise less data than a source data for each of the executions of the simulation of the plurality of scenarios of the combinatorial process and wherein the source data generated by each of the executions of the simulation of the plurality of scenarios of the combinatorial process used to generate one or more of the first level summary and the second level summary is discarded, by sending an instruction to at least one memory device that stores the source data, responsive to at least one of the first level summary and the second level summary being generated; and in response to a user query that includes one or more ranges of exploration attributes that restrict the user query to a specific set of selected scenarios to be considered, based on said indication of said simulation data that will be queried following said simulation, and one or more of (i) said key attributes, (ii) said events, and (iii) said causality between a plurality of said events, performing the following steps: interpreting said user query; accessing second level summaries, without accessing the source data, of said selected scenarios to retrieve the information related to said key attributes, events and causality expressed in the query; and producing as a query response output a third level summary that aggregates the information accessed from said second level summaries of said selected scenarios and contains one or more of (i) probability distribution functions of said key attributes, (ii) probability distribution functions of the number of occurrences of said events, and (iii) composed probabilities of the causality relationships between said events, wherein the method is performed by at least one processing device comprising a processor coupled to a second memory.
 2. The method of claim 1, wherein said first level summary and said second level summary are generated during said simulation, and wherein said second level summaries are subsequently used to generate said third level summaries in response to one or more of said user queries.
 3. The method of claim 1, wherein said simulation data that will be queried further comprises one or more hierarchies of one or more aggregation attributes that group one or more of said key attributes at all summarization levels.
 4. The method of claim 3, further comprising the step of storing sub-totals of said key attributes for each level of said hierarchy of aggregation attributes.
 5. The method of claim 1, wherein said key attributes comprise one or more key element attributes representing attributes of the elements processed by said simulation and one or more key temporal attributes representing properties assigned to a time instant of said simulation.
 6. The method of claim 3, wherein said step of interpreting said user query further comprises extracting one or more of (i) desired key attributes, (ii) a subset of valid values or intervals for each of said exploration attributes, and (iii) a subset of valid values or intervals for each of said aggregation attributes as defined by the user query.
 7. Method of claim 6, wherein said step of interpreting said user query further comprises extracting one or more of (ii) the subset of valid values or intervals for each of said exploration attributes, and (iii) the subset of valid values or intervals for each of said aggregation attributes as defined by the user query and wherein said step of accessing second level summaries of said selected scenarios to retrieve the information is based on said subset of valid values or intervals.
 8. The method of claim 1, wherein said simulation occurs in parallel among one or more compute nodes on a distributed computing infrastructure, and wherein each one of the first level summary, the second level summary and the third level summary are generated in parallel among one or more compute nodes of said distributed computing infrastructure.
 9. The method of claim 1, wherein said first level summaries and second level summaries are computed using volatile in-memory storage and subsequently persisted in non-volatile disk storage for future use.
 10. A computer program product, comprising a non-transitory machine-readable storage medium having encoded therein executable code of one or more software programs, wherein the one or more software programs when executed by at least one processing device perform the following steps: obtaining a first level summary for each execution of a simulation of a plurality of scenarios of a combinatorial process, wherein each of said plurality of scenarios corresponds to a distinct combination of exploration attributes, wherein said simulation comprises a combination of said exploration attributes comprising a plurality of independent variables that are varied during said simulation and key attributes of said combinatorial process that are a target of said simulation, and wherein a user has provided, prior to a time of the simulation, an indication of simulation data that will be queried following said simulation, wherein said simulation data that will be queried comprises one or more of (i) a set of said key attributes, (ii) a set of events, and (iii) a set of causality relationships between a plurality of said events, wherein each of said first level summaries comprise one or more of (i) a summary of said key attributes indicating a frequency distribution of each attribute value in the one or more of said key attributes, (ii) a timestamp of occurrences of each of said events, and (iii) an indication of whether said causality between said plurality of said events is observed during said simulation; obtaining a second level summary for each of said scenarios, wherein each of said second level summaries summarizes one or more executions of said given scenario and comprises one or more of (i) a consolidated distribution probability for each of said key attributes based on an aggregation of the frequency distribution of each attribute value in the first level summaries, (ii) a frequency distribution of occurrences of each of said events in a single execution based on the timestamp of occurrences over time from the first level summaries, and (iii) a frequency of observations of the causality between each pair of said events based on the indication of whether said causality between said plurality of said events is observed during said simulation from the first level summaries, wherein the first level summary and the second level summary comprise less data than a source data for each of the executions of the simulation of the plurality of scenarios of the combinatorial process and wherein the source data generated by each of the executions of the simulation of the plurality of scenarios of the combinatorial process used to generate one or more of the first level summary and the second level summary is discarded, by sending an instruction to at least one memory device that stores the source data, responsive to at least one of the first level summary and the second level summary being generated; and in response to a user query that includes one or more ranges of exploration attributes that restrict the user query to a specific set of selected scenarios to be considered, based on said indication of said simulation data that will be queried following said simulation, and one or more of (i) said key attributes, (ii) said events, and (iii) said causality between a plurality of said events, performing the following steps: interpreting said user query; accessing second level summaries, without accessing the source data, of said selected scenarios to retrieve the information related to said key attributes, events and causality expressed in the query; and producing as a query response output a third level summary that aggregates the information accessed from said second level summaries of said selected scenarios and contains one or more of (i) probability distribution functions of said key attributes, (ii) probability distribution functions of the number of occurrences of said events, and (iii) composed probabilities of the causality relationships between said events.
 11. The computer program product of claim 10, wherein said first level summary and said second level summary are generated during said simulation and wherein said second level summaries are subsequently used to generate said third level summaries in response to one or more of said user queries.
 12. The computer program product of claim 10, wherein said simulation data that will be queried further comprises one or more hierarchies of one or more aggregation attributes that group one or more of said key attributes at all summarization levels.
 13. The computer program product of claim 12, wherein said step of interpreting said user query further comprises extracting one or more of (i) desired key attributes, (ii) a subset of valid values or intervals for each of said exploration attributes, and (iii) a subset of valid values or intervals for each of said aggregation attributes as defined by the user query.
 14. The computer program product of claim 13, wherein said step of interpreting said user query further comprises extracting one or more of (ii) the subset of valid values or intervals for each of said exploration attributes, and (iii) the subset of valid values or intervals for each of said aggregation attributes as defined by the user query and wherein said step of accessing second level summaries of said selected scenarios to retrieve the information is based on said subset of valid values or intervals.
 15. A system, comprising: a first memory; and at least one processing device, coupled to the memory, operative to implement the following steps: obtaining a first level summary for each execution of a simulation of a plurality of scenarios of a combinatorial process, wherein each of said plurality of scenarios corresponds to a distinct combination of exploration attributes, wherein said simulation comprises a combination of said exploration attributes comprising a plurality of independent variables that are varied during said simulation and key attributes of said combinatorial process that are a target of said simulation, and wherein a user has provided, prior to a time of the simulation, an indication of simulation data that will be queried following said simulation, wherein said simulation data that will be queried comprises one or more of (i) a set of said key attributes, (ii) a set of events, and (iii) a set of causality relationships between a plurality of said events, wherein each of said first level summaries comprise one or more of (i) a summary of said key attributes indicating a frequency distribution of each attribute value in the one or more of said key attributes, (ii) a timestamp of occurrences of each of said events, and (iii) an indication of whether said causality between said plurality of said events is observed during said simulation; obtaining a second level summary for each of said scenarios, wherein each of said second level summaries summarizes one or more executions of said given scenario and comprises one or more of (i) a consolidated distribution probability for each of said key attributes based on an aggregation of the frequency distribution of each attribute value in the first level summaries, (ii) a frequency distribution of occurrences of each of said events in a single execution based on the timestamp of occurrences over time from the first level summaries, and (iii) a frequency of observations of the causality between each pair of said events based on the indication of whether said causality between said plurality of said events is observed during said simulation from the first level summaries, wherein the first level summary and the second level summary comprise less data than a source data for each of the executions of the simulation of the plurality of scenarios of the combinatorial process and wherein the source data generated by each of the executions of the simulation of the plurality of scenarios of the combinatorial process used to generate one or more of the first level summary and the second level summary is discarded, by sending an instruction to at least one second memory device that stores the source data, responsive to at least one of the first level summary and the second level summary being generated; and in response to a user query that includes one or more ranges of exploration attributes that restrict the user query to a specific set of selected scenarios to be considered, based on said indication of said simulation data that will be queried following said simulation, and one or more of (i) said key attributes, (ii) said events, and (iii) said causality between a plurality of said events, performing the following steps: interpreting said user query; accessing second level summaries, without accessing the source data, of said selected scenarios to retrieve the information related to said key attributes, events and causality expressed in the query; and producing as a query response output a third level summary that aggregates the information accessed from said second level summaries of said selected scenarios and contains one or more of (i) probability distribution functions of said key attributes, (ii) probability distribution functions of the number of occurrences of said events, and (iii) composed probabilities of the causality relationships between said events.
 16. The system of claim 15, wherein said first level summary and said second level summary are generated during said simulation and wherein said second level summaries are subsequently used to generate said third level summaries in response to one or more of said user queries.
 17. The system of claim 15, wherein said simulation data that will be queried further comprises one or more hierarchies of one or more aggregation attributes that group one or more of said key attributes at all summarization levels.
 18. The system of claim 17, wherein said step of interpreting said user query further comprises extracting one or more of (i) desired key attributes, (ii) a subset of valid values or intervals for each of said exploration attributes, and (iii) a subset of valid values or intervals for each of said aggregation attributes as defined by the user query.
 19. The system of claim 15, wherein said simulation occurs in parallel among one or more compute nodes on a distributed computing infrastructure, and wherein each one of the first level summary, the second level summary and the third level summary are generated in parallel among one or more compute nodes of said distributed computing infrastructure.
 20. The system of claim 15, wherein said first level summaries and second level summaries are computed using volatile in-memory storage and subsequently persisted in non-volatile disk storage for future use. 